-
Notifications
You must be signed in to change notification settings - Fork 50
Parse PCRE callouts, backtracking directives, and .NET balanced captures #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@swift-ci please test Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
.empty, .groupTransform: | ||
return false | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used to emit a diagnostic for backreference directives that aren't quantifiable:
aea6d9d#diff-4f8c3a04c147c57dc383b837f459ee51c8bec6de78d061d54ac7e228f5228dc5R1328-R1334
extension AST.Atom { | ||
public struct Callout: Hashable { | ||
public enum Argument: Hashable { | ||
case number(Int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these references? That is, could they be relative or are they always absolute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they're just plain arguments to the callout function. As I understand it, PCRE expects you to give it a single callout function, and it gets called with the argument specified to let you differentiate which call it is.
/// (*SKIP) | ||
case skip | ||
/// (*THEN) | ||
case then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a newline to separate cases? Is there anywhere in the repo where these terms or concepts are defined or have a quick blurb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, and no there currently is not. Should I elaborate on the comments here? Or split out a separate document to explain some of the more obscure features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking the comment at first, but I do think a document sounds like a really good idea. That can serve as fodder for pitches too. I suspect the very first fully-realized pitch will be the literal-interior syntax.
} | ||
// '(?C)' is implicitly '(?C0)'. | ||
if src.peek() == ")" { | ||
return .number(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a bit somewhere tracking the implicit-ness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could probably be derived from the source info, but yeah it might be worthwhile to store a bit to indicate this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning on refactoring the callout AST type as a part of handling the Oniguruma-specific callout syntax, so I'll take care of this in that patch
@@ -44,7 +44,10 @@ extension AST { | |||
return .atom() + innerCaptures | |||
case .namedCapture(let name): | |||
return .atom(name: name.value) + innerCaptures | |||
case .balancedCapture(let b): | |||
return .atom(name: b.name?.value) + innerCaptures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @rxwei who is fixing some bugs around this area
Parse the `(?C)` syntax with an integer or string argument. This doesn't yet handle the Oniguruma specific callout syntax, which is a little more involved.
This requires generalizing `canLexGroupLikeAtom` a bit to treat all `(*` groups as being atoms, and as such we need to special-case the PCRE2 explicit group syntax. We do it this way around to accommodate the extended Oniguruma callout syntax which also uses `(*`, which we aim to support.
This requires imposing some restrictions on what can be used as a group name to allow for the syntax `(?<a-b>)`. For now, restrict the characters to letters, numbers and `_`, and forbid the first character from being a number. This should be no stricter than the rules imposed by PCRE, Oniguruma, ICU, Java and .NET.
07cb802
to
df6be0c
Compare
@swift-ci please test Linux |
Parse PCRE callouts
(?C)
with an integer or string argument. This doesn't yet handle the Oniguruma-specific callout syntax, which is a little more involved.Parse PCRE backtracking directives e.g
(*ACCEPT)
. This requires generalizingcanLexGroupLikeAtom
a bit to treat all(*
groups as being atoms, and as such we need to special-case the PCRE2 explicit group syntax. We do it this way around to accommodate the extended Oniguruma callout syntax which also uses(*
, which we aim to support.Parse .NET balanced captures. This requires imposing some restrictions on what can be used as a group name to allow for the syntax
(?<a-b>)
. For now, restrict the characters to letters, numbers and_
, and forbid the first character from being a number. This should be no stricter than the rules imposed by PCRE, Oniguruma, ICU, Java and .NET.